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1. INTRODUCTION 

The email is one of the communication methods that has been frequently used by people in many 
fields such as education, business, and for personal matters. By 2025, the number of daily exchanged emails 
is expected to increase over 375 billion emails [1]. The reason for this widespread use is due to the 
effectiveness to satisfy the users’ need, in addition to the fact that it is cost-free. Spam emails or unsolicited 
emails are known to be emails that advertise drugs, cheap mortgage rates, and items for sale. The average 
percentage of spam in global email traffic was 45,67% in Q1 2021 [2]. Sending millions of spams (through 
email and other messeging systems such that Youtube [3], [4]) is considered a lucrative business because the 
profit is still large even if a very small percentage of responses is received. These emails cause troubles for 
both users and internet services. Beyond just being annoying and disruptive, spam significantly reduces work 
productivity as users spend time checking and deleting these emails. Spam emails are also used to widely 
distribute malware as attachments with the aim to damage the user’s data. Moreover, spam emails are used 
for social engineering attacks to steal user’s confidential data. As for the internet service, spam emails impose 
significant cost on the network infrastructure needed to relay this traffic. Henceforth, an intensive protection 
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mechanism is required against spam emails. Most organizations install and monitor spam filters to block 
spam before it ever reaches the user. However, spammers continue to develop new sophisticated techniques 
to circumvent detection by spam filters, rendering the protection of email communications from spammers a 
very challenging task that needs a lot of work and improvements. 

Various studies have been carried out to detect spam emails by experimenting the potential 
possibility of creating and applying different machine learning (ML) algorithms and models. Many 
comparative studies aimed to identify the best model that can produce the highest accuracy rate in detecting 
spam emails [5]-[7]. The highest achieved accuracy is 94.2% obtained by using the random forest classifier 
[7]. To enhance the performance of the machine learning (ML) algorithms, many studies combined them 
with bio-inspired algorithms, or artificial neural networks [8]-[10]. Particle swarm optimization (PSO) is 
among the most popular algorithms that has been used in spam email detection. However, ML techniques 
cannot handle a large dataset of millions records [11]. In addition, better results are only achieved after 
performing data preprocessing and dimensionality reduction. These two phases require more effort and time 
to be processed. 

The main disadvantage of the ML methods is that they are unable to learn lower-level features 
contrary to the deep learning methods. Deep learning (DL) is one field of ML that involves using neural 
networks (NN) [12]. It has a high ability in learning from abstract features which eliminates the need for data 
processing and dimensionality reduction as it is the case with ML algorithms. The data pass through different 
hidden layers such that the information learned in the previous layer will serve future layers. DL techniques 
have demonstrated higher accuracy performance than the traditional ML algorithms especially when dealing 
with large datasets [13]. DL has been used in many applications such as speech recognition, visual 
recognition, and drug discovery [14]. Two DL techniques are widely used, namely recursive neural network 
(RNN) and convolution neural network (CNN). These techniques were recently used in spam detection for 
social media [13], [15], [16], but have not been widely used in the detection of the spam emails. In this paper, 
we take advantage from the capabilities of the RNN technique to enhance the detection of spam emails. 
Moreover, a comparison will be addressed with recent papers to evaluate the effectiveness of the proposed 
method using the SpamBase dataset. 

This paper is organized as follows. An overview of the related works using DL approaches to spam 
detection is provided in section 2. Section 3 introduces the proposed technique. Section 4 addresses the 
experimental study, presents the results and the discussion. Finally, the conclusion is drawn in section 5. 


2. RELATED WORKS 

Recent studies have shown that deep learning outperforms the standard machine learning procedures 
to detect spam messages. In the paper, Mi et al. [17], realized the power of DL for email spam detection 
using stacked auto-encoder (SAE). Information gain (IG) and bag-of-words (BoW) are used to extract feature 
vectors from email samples. The experimental results on different benchmark databases such as PUI, PU2, 
PU3, PUA, and enron-spam, showed that the proposed approach achieved 97.02% average accuracy which 
outperformed traditional machine learning algorithms such naive bayes (NB), support vector machine 
(SVM), decision tree (DT), boosting, random forest, and traditional artificial neural network (ANN). The 
authors mentioned that the limitation of SAE is the running time. They encouraged to use other types of DL 
and to investigate the parameter setting to well enhance the performance and reduce the execution time. 

In the paper, Barushka and Hajek [18], proposed a spam filter by combining an ngram TF-IDF 
feature selection, deep multi-layer perceptron NN and balancing distribution-based algorithm. The aim of 
using the balancing distribution-based algorithm is to overcome the imbalanced datasets. The 
experimentation was conducted on four spam databases, SpamAssassin, enron, social networking, and SMS 
spam. The proposed approach was successfully compared to spam filters existing techniques such factorial 
design using NB and SVM, random forest, minimum description length, voting and convolutional neural 
network (CNN). The proposed approach showed an accuracy of 98.76% and 99.89% using enron-spam and 
SpamAssassin dataset, respectively. The authors mentioned that the proposed approach cannot be used as an 
online spam filtering technique due to the high running time caused by the additional hidden layers. 

In the paper, Chetty et al. [19], proposed a deep learning based spam detection model. This deep 
model has two architectures to deal with the numeric and text data respectively. The first architecture consists 
of an input layer that has 57 nodes, 2 hidden layers with 16 nodes each, followed by a dropout layer, and the 
output layer that has 1 node. The second architecture comprises the word embedding layer followed by 
pooling, dense and output layers. Both architectures used the rectified linear unit (ReLu) as the activation 
function for the dense layer, and Sigmoid function is chosen for the output layer. Adaptive moment 
estimation (ADAM) optimizer was employed along with the cross-entropy loss function. The model (with the 
first architecture) achieved 92.8% accuracy and 84.9% F1 score on SpamBase dataset. The authors did not 
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investigate the running time. However, they mentioned that the present model could be improved by 
exploring the parameter setting to find the best values that can eventually enhance its performance. They also 
advised to use other types of DL architectures. 

In the paper, Alauthman [20], introduced a Bot spam email detection system by using gated 
recurrent unit recurrent neural network (GRU-RNN) with SVM. First, the CART algorithm was employed 
for feature reduction. GRU is then used to solve the gradient problem encountered in traditional RNN, 
allowing a faster training phase. The author used ADAM optimizer and the cross-entropy cost function. At 
the final step of the neural network, the prediction of the model is computed by employing the decision 
function of SVM. The approach in this study showed a detection rate of 98.7% when applying the minimum 
number of features using the SpamBase dataset. However, the DL related parameters were randomly set with 
no experimental analysis and the execution time was not shown. The author stated that the proposed model 
could be further improved by combining it with other ML techniques. 

In the paper, Sumathi and Pugalendhi [21], and introduced a hybrid approach for spam detection 
using random forest and deep neural network (DNN). The random forest algorithm is used to select the 
important features with the gini measure. These features are then trained using DNN classifier. The authors 
used backprobagation with one hidden layer and 10 hidden nodes. They employed softmax as the loss 
function. However, they did not investigate the DL parameters to find the best values. Experimental results 
obtained on the SpamBase dataset showed that the classification rate of DNN outperformed K-nearest 
neighbor (K-NN) and SVM with an accuracy of 88.59% when only considering the top-ranked five features. 
The authors did not investigate the running time. They encouraged the use of bioinspired-based feature 
selection algorithms to enhance the accuracy instead of exploring the DL architectures and parameters. 

In the paper, Hossain et al. [22], proposed some ML and DL techniques for spam detection model. 
Outliers are first removed from the dataset using Isolation Forest. The ML techniques were random forest, K- 
NN, and multinominal NB. While DL technique was based on RNN with one input/output layer. The number 
of the hidden layers was not set by the authors. The accuracy of 99.28% is achieved using RNN on 
SpamBase dataset. The authors showed that reducing the feature set can significantly reduce the running time 
for ML techniques. However, this hypothesis was not demonstrated when using RNN. They encouraged to 
use ensemble learning for spam detection. 

In the paper, AbdulNabi and Yaseen [23], used word embedding and bidirectional encoder 
representations from transformers (BERT) to detect spam from text emails. They employed the attention 
layers to handle the texts. The parameters were set without valid investigation. The epochs number was set to 
3, the batch size equaled to 32, the learning rate was set to 4e-5, and the optimizer to ADAM. The loss 
function is not defined. Two datasets (HAM and Spam) were merged and used to test and compare the 
proposed model against some ML techniques and Bidirectional long-short term memory (LSTM). The new 
model achieved an accuracy of 98.67%. The authors set the sequence to 300 tokens or words but stated that 
the results could be improved if the sequence is increased. They also encouraged exploring further spam 
detection in different languages, particularly in Arabic language. 

In Baccouche et al. [24], proposed a new DL model to detect spam in social media and emails. They 
used a modified version of multi-label LSTM model and bigram to handle texts and extract the spam. Many 
layers were used such that the Embedding, LSTM, dense, fully commnected and output layers. The softmax 
activation function was employed along with the cross-entropy loss function. The related parameters were 
defined but without a deep investigation of the optimal values. Two datasets (Fraud and Spam) were 
combined to validate the proposed model. The results demonstrated the effectiveness of this model in 
recognizing malicious text unrelatedly of the source. The accuracy achieved 92.7%. However, the authors did 
not show its performance in terms of running time. 

To sum up, only few articles handling email spam detection based on DL were found. The authors 
employed Stacked auto-encoder, deep multi-layer perceptron, gated recurrent unit recurrent neural network, 
recurrent neural network, and LSTM. Contrary to the ML-based email spam detection studies (discussed in 
the introduction), DL is not well involved in this research field. Moreover, spam detection is mainly 
investigated in the social media [25] and SMS [26] using DL and ML. This study fills the gap found in this 
research area. Moreover, it presents a deep investigation of the parameter setting which was not done in the 
literature review. 


3. RESEARCH METHODOLOGY 

Deep learning methods have been deployed for detection of spam messages especially in social 
media. However, only few studies used deep learning for detecting spam emails. In this paper, the recurrent 
neural network is further investigated to improve the accuracy of spam emails detection. 
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3.1. Deep learning approach (DL) 

Deep learning is a subfield of machine learning that was developed in the 1980s to enhance the 
ANN to be used in processing large datasets, owing to the availability of massive data and powerful 
computers. In the paper, Schmidhuber [12], Schmidhuber proposed a comprehensive survey on deep learning 
neural networks (DNN). These DNNs have many hidden layers that mimic the functionalities and the power 
of the human’s brain. The number of hidden layers differentiates a simple NN from a DNN as shown in 
Figure 1 [27]. 


Simple Neural Network Deep Learning Neural Network 


@input Layer @ Hidden Layer Output Layer 


Figure 1. The difference between simple and deep learning neural networks [27] 


3.2. Recurrent neural network (RNN) 

In this study, RNN is used to process large dataset to classify the emails as spam or not spam. RNN 
is a supervised learning technique that mimics the short-term memory. The short-term memory is located at 
the frontal lobe part of the brain. The concept behind RNN is that it remembers the knowledge it learned 
from the previous observations. Then, it uses this knowledge as it moves forward. In the case of RNN, the 
hidden layers do not just produce outputs, but it also feedbacks itself as displayed in Figure 2. The RNN 
utilizes the power of short-term memory in processing the data [28]. 


Figure 2. RNN simple structure 


The neurons are connected to themselves through time. This represents the concept of having a sort 
of memory (short-term memory (STM)). The neurons can remember what was in them previously. The 
neuron learns from the previous observations and passes this knowledge to the next neurons. For example, 
when RNN is used for translating a sentence, the network needs to remember the translations of each word as 
it proceeds forward to understand the concept and therefore develop an accurate translation. There are 
different types of RNN: (a) “One—Many” which can be used when describing an image using different words, 
for example. The image is the input while the texts are the output. (b) “Many—One” which can be used in the 
semantic analysis such as analyzing if the text is positive or negative. The input, in this case, will be a group of 
texts while the output is one value as positive or negative. (c) “Many—Many” which is used in google translation 
for example, where different words in one language can be translated into another language [9]. 

RNN learns from the output generated from previous neurons. The time measured for the output to 
become the input is called timestep [29]. The inputs are the number of attributes. In RNN, the weights of the 
inputs are computed along with the previous output before applying the activation function. This output will 
be the input for the next layer. The number of iterations of all the training instances with one forward pass 
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and one backward pass is presented by the value of the epochs [30]. The hidden layer contains a set of 
neurons generated by using the (1). 


Ng 
Nn = “(ec (Ni + No) a) 


where Ni is the number of input neurons, No is the number of output neurons, Ns is the number of samples in 
training dataset, and a is an arbitrary scaling factor between 2 and 10 according to [31]. 

The hidden layer uses an activation function which transforms the input signal into an output to be 
used in the following layer. The most common types of activation functions are Sigmoid, ReLu, and Tanh. 
The Sigmoid activation function is defined in (2). 


f(x) = 1a + exp(—x)) (2) 


Its value ranges between 0 and 1 and represented as a S-shaped curve. The concept of the Sigmoid is easy to 
understand and apply. However, the generated outputs are not zero-centered and are plotted in a scattered 
manner making the optimization harder. Therefore, it causes a slow convergence rate which makes it less 
popular than the other functions [32]. 

Tanh solves the issue of centralizing the output to zero because the values range between -1 and 1. 
Hence, the optimization can happen easily [31]. The Tanh formula is defined in (3). 


i= wie + exp(—2x))71 (3) 


The Sigmoid and Tanh functions suffer from the vanishing gradient problem. It is a problem associated with 
the selected activation function, where the parameters of the early layers in the NN becomes extremely small 
causing the accuracy of the prediction to drop [31]. However, ReLu overcomes the vanishing gradient 
problem and proved that it converges six times better than Tanh function which makes it the most common 
activation function used in DL [33]. The formula of the ReLu is defined in (4). 


f (x) = max(0,x) (4) 


Furthermore, the dropout rate is one of the parameters that can be used to enhance the performance of the 
neural networks. This is due to its ability to avoid the overfitting problem in NN and it ranges between 0 and 
1 according to [34]. 


4. RESULTS AND DISCUSSION 

In this study, the RNN was used to detect spam emails in the SpamBase dataset from the UCI 
Machine Learning Repository. The dataset has 57 attributes of 4601 email messages. Deep learning studio, 
which is an open-source tool for building deep learning networks, was used to train and test the data. 
Multiple experiments are done based on changing the activation function, the dropout rate, and the number of 
epochs as seen in Figure 3. 


Hidden Layer 


Embedding 
Layer 


Dropout Rate 
0.1 ORO.2 


Dense Layer 


Simple RNN 


Activation Activation 
Function: Function: 


=| Tanah OR >| Tanah OR 
Sigmoid OR Sigmoid OR 
Relu Relu 
Dropout Rate 


0.1 ORO.2 


Figure 3. RNN Structure used in this study 
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The parameters providing the best results are selected for this experiment. The number of neurons in 
the input layer is equal to the number of the attributes. Three rounds are performed, each with four runs. In 
each round, the activation function is fixed while the epoch number and the dropout rate are changed. For the 
first round, the Tanh activation function is set. Then, for the first two runs, the dropout is equal to 0.1 while 
changing the epochs number for each run to 50 and 100, respectively. For the other remaining runs in the first 
round, the dropout rate is changed to 0.2, and the epochs number for each run is set to 50 and 100, 
respectively. The same process is repeated in the next two rounds while changing the activation function for 
each round. The dataset is divided into three sets as 80%, 10%, and 10% for the training, validation, and 
testing, respectively. The accuracy rate for the training and the validation is documented for all the runs. The 
testing accuracy is only mentioned for the highly performed run. The best accuracy result generated from the 
proposed model is compared with the accuracy of the recent papers that used the same dataset. 


4.1. Experimental study 

After implementing the RNN on the SpamBase dataset using different settings, the accuracy of the 
training and the validation are listed in Table 1. After using different activation functions to RNN, Tanh 
attained the highest results in the overall runs. The highest values achieved by Tanh was 97% and 99% for 
the training accuracy and the validation accuracy, respectively. To validate our model, testing dataset is used 
and attained an accuracy of 99.7%. 


Table 1. Accuracy of RNN using different settings 


Round#  Run#_ Activation Function Dropout Rate Epochs Training Accuracy Validation Accuracy 


1 1 Tanh 0.1 50 94% 98% 
2 Tanh 0.1 100 97% 99% 
3 Tanh 0.2 50 91% 94% 
4 Tanh 0.2 100 88% 90% 
2 5 Sigmoid 0.1 50 91% 95% 
6 Sigmoid 0.1 100 91% 91% 
7 Sigmoid 0.2 50 88% 89% 
8 Sigmoid 0.2 100 90% 87% 
3 9 ReLu 0.1 50 88% 94% 
10 ReLu 0.1 100 97% 96% 
11 ReLu 0.2 50 84% 85% 
12 ReLu 0.2 100 84% 98% 


4.2. Comparative analysis 

Many works have been carried out to improve the accuracy of spam emails detection. Most of these 
research studies combined models of ML, ANN, or Bio-Inspired algorithms. Several researchers provided the 
comparison between the different classifiers used in the literature. In Table 2, the result produced by the 
proposed RNN-DL approach is compared with the recent spam emails detection studies that used the 
SpamBase dataset. 

The comparison study comprised three categories. The first category involves the comparison with 
ML based-related works. In this part, two well known ML techniques (rotation forest and bayesian logistic 
regression) [7] were compared with the proposed model. The second category encompasses the comparison 
with the hybrid studies based on ML and bio-inspired algorithms and/or neural networks. In this part, three 
related works were investigated. The first one was based on PSO and the decision tree J48 [8], the second 
tackled MLP neural network and biogeography-based optimization [9], while the last work proposed SVM- 
based PSO and MLP [10]. The third category includes the comparison with DL related works. In this part, 
four related works were used for comparison including CNN [19], GRU-RNN with SVM [20], random forest 
integrated with deep neural network [21], and RNN with isolation forest [22]. 

By referring to Table 2, the highest accuracy, using ML techniques, was 94.2% achieved by random 
forest (RF) for predicting the emails as spam or not spam. RF is a ML classifier that creates multiple decision 
trees based on a randomly selected subset. This method is popular in classification due to its performance to 
obtain a high accuracy [35]. Even though the ML techniques performed well on the SpamBase dataset, their 
accuracy could not beat the deep learning neural network. The majority of the results obtained from using 
RNN are more accurate than machine learning techniques as well as the multilayer neural network. An 
accuracy of 98.7% is the best obtained result using DL algorithm, which is achieved when combining gated 
recurrent unit recurrent neural network with SVM and employing the minimum number of features. In this 
study, the highest accuracy achieved by the proposed RNN is 99.7%. This drives the conclusion that RNN is 
effective in Spam detection especially when using the Tanh activation function with 100 epochs and dropout 
rate of 0.1. 
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Table 2. Comparison results against the state-of-the-art methods references 


Category Reference Algorithm(s) Accuracy 
Machine Learning [7] Rotation Forest 94.2% 
[7] Bayesian Logistic Regression 93.1% 
Bio-inspired [8] PSO and the Decision Tree J48 98.3% 
[9] MLP Neural Network and Biogeography Based Optimization 88% 
[10] SVM-PSO & MLP for feature selection 93.07% 
Deep Learning [19] Modified DL 92.8% 
[20] GRU-RNN with SVM 98.7% 
[21] Random Forest integrated with Deep Neural Network 88.59% 
[22] RNN with Isolation Forest 99.28% 
Proposed Model Customized RNN 99.7% 


5. CONCLUSION 

The power of deep learning techniques has been utilized in this study for the detection of spam 
emails. RNN is used with different configurations regarding the activation function, the number of epochs, 
and the dropout rate. The highest result is 99.7% which is obtained when setting the parameters with Tanh for 
activation function, 0.1 for the dropout rate, and 100 for the number of epochs. Moreover, the proposed 
scheme is compared with other studies that used SpamBase dataset. The proposed RNN has shown to 
outperform the best accuracy of 98.7% achieved by the approach combining Gated Recurrent Unit Recurrent 
Neural Network with SVM and employing the minimum number of features. The future direction of this 
study is to apply the long-short term memory (LSTM) in spam email classification. LSTM is a particular type 
of RNN that can generate better results. 
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